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•  Streaming  Languages  for  HPEC  /  Polymorphic  Computer 
Architectures  (PCA). 

-  Mapping  challenges 

-  R-Stream™  compiler  for  StreamC  and  KerneIC 

-  Streaming  Language  Design  Choices 

-  Thoughts  on  Mapping 

•  Dynamic  Compilation 

-  PCA  objectives  that  dynamic  compilation  helps  meet 

-  Runtime  Compilation  Issues/Technologies 

-  Example  of  How  Dynamic  Compilation  Helps  with  Component  Selection 

-  Insertion  of  New  Code  into  Running  Systems 

-  Approaches  for  using  Dynamic  Compilation  Reliably 
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Polymorphic  (PCA)  Architectures 


IMAGINE  (Stanford:  Rixner  et.  al.,  1998) 


=  ALU  Cluster 


=  ALU  Cluster 


-  ALU  Cluster 
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=  ALU  Cluster 


High  arithmetic/memory  ratio 

Paraiieiism 

Less  controi  iogic 

Synchrony 

Programmabie 

Exposed  resources 


00  0 

Programming  and  Mapping 
Chaiienge 


Other  Examples:  RAW  (MIT),  VIRAM  (Berkeley),  Smart  Memories  (Stanford) 


r  e\s  e  r  \v  ^ 


HPEC2002 


Imagine  Performance  (Dally  et.  al  2002) 


Arithmetic  Bandwidth 

Application  Performance 

Applications 

Stereo  Depth  Extraction 

11.92  GOPS  (16-bit) 

320x240  8-bit  gray  scale 

MPEG-2  Encoding 

15.35  GOPS  (16-  and  8 -bit) 

320x288  24-bit  color  at  287  fps 

QR  Decomposition 

10.46  GFLOPS 

192x92  matrix  decompositions 
in  1 .44  ms 

Polygon  Rendering 

5.91  GOPS  (fp  and  integer) 

35.6  fps  for  720x720  “ADVS” 
benchmark 

Polygon  Rendering  with  Real- 
Time  Shading  Language 

4.64  GOPS  (fp  and  integer) 

16.3M  pixels/second 

1 1 .1 M  vertices/second 

Kernels 

Discrete  Cosine  Transformation 

22.6  GOPS  (16-bit) 

34.8  ns  per  8x8  block  (1 6-bit) 

7x7  Convolution 

25.6  GOPS  (16-bit) 

1 .5us  per  row  of  320  1 6-bit 
pixels 

FFT 

6.9GFLOPS 

7.4  us  per  1 ,024-point  floating 
point  complex  FFT 

(6  ALUs  *  8  PES  *  400  MHz  =  19.2  GOPS  (peak),  0.1  Sum,  2.56cm^2,) 
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Compiler  Mapping  Challenges 


Mapping  Requirements 

Identify  Parallelism 
Partition  Program 
Select  Operators 
Place  data 
Layout  data 
Place  computation 
Schedule  computation 


Resources 

Parallel  functional  units 
Parallel  tiles 
Distributed  memories 
Local  bypass  wires 
Inter-tile  wires 


Constraints 

Functional  unit  availability 
Distributed  register  files 
Small  bounded  memories 
Small  or  no  queues 
Partial  interconnect 


Increasing  the  arithmetic/memory  ratio  while  holding  silicon  area  fixed 
simultaneously  increases  resources  and  tightens  constraints. 

^  Automatic  compilation  MUCH  harder 

^  Place  some  burden  on  programmer  and  invest  in  compilers 
^  Introduce  new  languages  to  assist  programmer 
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KerneIC  /  StreamC  Languages  (Stanford:  Mattson  2001) 


KerneIC  Example 

StreamC  Example 

kernel  SAXPY( 

stream  prog  testSAXPY  (String  args)  { 

istream<float>  x_s, 

const  int  testSize  =  128; 

istream<float>  y_s, 

stream<float>  x_s  (testSize); 

ostream<float>  result_s, 

stream<float>  y_s(testSize); 

float  a) 

stream<float>  result_s(testSize  *  2); 

{ 

stream<float>  evenResult_s  = 

loop_stream(x_s)  { 

result_s(0,  testSize,  STRIDE,  2); 

float  X,  y,  result; 

stream<float>  oddResult_s  = 

x_s  »  x; 

result_s(1,  testSize,  STRIDE,  2); 

y_s  »  y; 

II  initialize  x_s  and  y_s 

result  =  a  *  X  +  y; 

■  >> 

result  s  «  result; 

II  compute 

} 

SAXPY(x_s,  y_s,  evenResult_s,  3.4); 

} 

SAXPY(x_s,  y_s,  oddResult_s,  6.7); 

} 

Make  High  Level  Dataflow  and  Low  Level  Parallelism  Explicit 
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R-Stream™:  Compiling  KerneIC  and  StreamC 


Compiling  KerneIC 
•Replicate  kernel  across  clusters 
for  SIMD  control. 

•Generate  conditional  stream  10 
operations  for  data-driven 
dynamic  behavior. 
•Modulo-schedule  the  kernel  to  get 
a  tight  loop. 

•Explicitly  manage  communication 
resource. 

^  Output  is  Imagine  SIMD 
control  program. 


Compiling  StreamC 
•Inline/flatten  program  into  Kernel 
call  sequence. 

•Constant  propagate  so  strip  size  is 
explicit. 

•Allocate  Stream  Register  File  (Local 
Memory)  with  a  Process  like  Register 
Allocation. 

Output  is  C++  program  for 
Imagine  controller,  that 
generates  stream  initiators  and 
synchronization  calls. 


R-Stream™  can  address  the  mapping  challenge 
posed  by  these  new  architectures. 
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StreamC  and  KerneIC  Language:  Observations 


•  Emphasis  is  on  letting  programmer  express  partitioning  and  parallelism. 

KerneIC  is  like  a  new  assembly  language  (e.g.,  modulo  scheduling  now  moves 
to  the  assembler.)  The  major  development  challenges  for  compilation  - 
determining  partitioning  and  mapping  -  are  at  the  level  of  the  StreamC 
compiler. 


•  Nevertheless,  there  is  interaction  between  StreamC  and  KerneIC  compilers 
which  leads  to  some  hand-holding  by  the  programmer  of  the  compiler.  Some 
of  this  is  due  to  architectural  problems  (e.g.,  spilling  scratch  registers)  and 
some  is  unavoidable  phase  interaction.  Should  they  be  fused? 


•  C++  syntax;  with  special  templates  and  libraries  can  be  compiled  by  a  generic 

C++  (e.g.,  Microsoft  Visual  C++)  compiler  for  code  development. 


•  Expression  of  conditional  stream  10  operations  allows  data-dependent  data 
motion,  and  maps  directly  to  special  hardware  on  Imagine  or  can  be 
synthesized  with  software  operations  (Kapasi  et.  al.  2000). 
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streaming  Language  Design  Choices 


Imperative  (e.g.  StreamC,  Brook  TStanfordl) 

kerneH (...,  streamB); 
kernel2(streamB,  ...); 

Easy  to  represent  state,  control-flow,  finite  streams 
Good  for  describing  array-based  algorithms 
(stride) 

Probably  map  well  to  conventional  DSP 


Dataflow  (e.g.  Streamit  FMITI,  Streams-C  FLANLI) 

kerneH  .outputi  .connect(streamB); 
kernel2.input1  .connect(streamB); 

go(); 

Easy  to  represent  task  parallelism,  infinite  streams 
Good  for  describing  continuously-running  systems 


The  history  of  streaming 
languages  is  long. 

These  programs  could  be 
transformed,  easily,  into 
FORTRAN 

Challenge  is  how  to  map! 

This  gets  much  harder 
with  dynamic  behavior. 

There  are  other  things  we 
might  need  to  express, 
(e.g.,  flexibility  in  ordering 
semantics,  periodic 
invocation,)  which  arise 
from  nature  of  embedded 
systems  (Lee  2002) 
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Thoughts  on  Mapping 


•  Key  technologies  for  compilation  for  PCA: 

-  Optimizations  that  incorporate  constraints  of  small  bounded  memories. 

-  Optimizations  that  incorporate  communications  bandwidth  constraints. 

-  Optimizations  that  incorporate  non-traditional  objectives  (e.g.,  latency, 
power). 

•  Probably: 

-  Will  see  increasing  fusion  of  compiler  phases  to  allow  fine-grained 
tradeoffs  between  parallelism  and  storage  use. 

-  More  and  more,  see  compilation  implemented  as  search/solvers  over 
constraint  systems. 

-  New  program  intermediate  representations  forms,  such  as  Array-SSA 
(Knobe  1998)  or  PDG  will  invigorate  development  of  these  key 
technologies. 
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Handling  Dynamic  Behavior  Looks  Difficult 


Dynamic  Behaviors 
•Data-dependent  iteration 
counts,  strides,  access  patterns. 
•Conditionals  within  loops. 

•Task  parallelism. 

•External  dynamic  conditions: 
mission  changes,  problem 
arrival,  etc. 

•Resource  changes  (hardware 
failures). 

•Changes  to  application  (e.g., 
dynamic  component 
replacement). 

•Morphing. 


Characterize  by: 

Size  of  change 

Time  constant  of  change 

Space  of  configurations 


Use  existing  technologies  and  tricks 
for  parallel  dynamic  load  balancing, 
throttling,  mapping,  scheduling  ... 
But  with  the  additional  challenge  of 
greater  degrees  of  parallelism  and 
more  severe  resource  constraints. 


If  finding  a  static  mapping  for 
PCA  hardware  is  so  hard,  how 
can  we  expect  to  achieve  the 
same  result  with  an  online, 
distributed,  and  low-overhead 
mapper? 
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Dynamic  Compilation:  Can  it  Apply  to  HPEC? 


PCA  Technology  Program  Objective 

Subset 

Dynamic  Compilation  Capabilities 
(as  demonstrated  by  R-JIT™  for  Java) 

Hardware  unknown  until  mission  (e.g., 
failures,  other  resident  apps). 

Optimization  at  mission  initiation. 

Mission  parameters  (#targets,  etc.) 
unknown  until  runtime. 

Runtime  profiling  and  profile-driven 
optimization. 

Runtime  assembly  and  selection  of 
library  components. 

Cross-module  optimization  at  runtime 
simultaneously  with  class  loading. 

Desire  for  a  morphware  “virtual 
machine” 

Java  virtual  machine. 

Changing  mission  parameters, 
objectives,  and  system  resources  at 
runtime. 

Runtime  recompilation  and  insertion  of 
new  code  into  running  systems. 
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Runtime  Compilation  Issues  /  Technoiogies 


•  How  can  a  compiler  be  made  fast  &  resource-efficient  enough  to  work 
effectively  at  runtime? 

-  Efficient  data-structures  and  IR. 

-  Integrated  optimizations:  do  multiple  mapping  and  optimization  tasks  in 
the  same  phase,  limit  phase  iterations. 

-  Apply  optimization  selectively  to  high-priority  sections. 

-  Sequence  from  light  to  heavy  optimization  based  on  priority. 

•  Nevertheless,  heavy  parallelism  and  constraint  nature  of  PCA 
architectures  presents  a  new  level  of  compilation  challenge  to  perform 
at  runtime. 

-  Optimizing  application  might  take  hours  or  more  (e.g.,  solving  complex 
constraint  system  using  integer  programming). 

-  We  can  start  by  apply  the  Java  strategies  to  these  new  compiler  problems 

•  E.g.,  achieve  variation  in  optimization  intensity  for  modulo  scheduling  by 
searching  first  for  schedules  in  looser  initiation  intervals. 

-  New  approaches 

•  E.g.,  pre-analyzed  or  partially  compiled  code  “finished”  at  runtime. 


r  e\s  e  r  \v  ^ 


HPEC2002 


Dynamic  Component  Selection:  Two  Possible  Approaches 


Component  Assemblies 

•Collections  of  precompiled 
components  (different 
problem  sizes,  algorithms). 

•Decorated  with  meta¬ 
information  (XML)  or 
selection  control  scripts. 

•Loader  mediates  meta¬ 
information  and  constraints 
to  select  component. 


Object-Oriented  Approach  (Java/CLOS) 

•Component  selection  is  handler  selection. 
•Loader  and  compiler  tightly  integrated. 
•Express  meta-information  within  the 
application  in  the  application  language. 
•Static  conditions  at  call  site  inferred  from 
program  context. 

•Dynamic  conditions  at  call  site  handled 
using  runtime  code  generation  (e.g.,  profile- 
motivated  speculation  and  generation  of 
call-site  predicates). 

•Expression  within  programming  language 
allows  seamless  inter-module  analysis  and 
optimization. 


The  object-oriented  and  dynamic  compilation 
approach  has  structured  and  tested  ways  of 
handling  this.  Advantages  in  engineering 
economy,  simplicity,  and  reliability. 
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Insertion  of  New  Code  into  Running  Systems 


•  The  challenge  of  dynamic  class  loading  in  Java,  and  how  to 
implement: 

-  Class  hierarchy  in  a  Java  application  is  not  static. 

-  For  competitive  performance,  JIT  must  perform  inter-module  optimizations 
using  the  speculation  that  class  hierarchy  is  static. 

-  The  action  of  a  dynamic  class  load  invalidates  that  speculation. 

-  How  to  transform  in-progress  invocations  (thread  queue,  call  stack?) 

•  How  do  we  transfer  PC? 

•  How  do  we  transform  optimized  state? 

-  One  approach  (Sun’s  Hotspot)  is  dynamic  decompilation  facility. 

•  Hairy,  slow,  complex  semantics.  How  to  validate? 

-  Reservoir’s  approach:  pseudo-conditionals  and  simple  code-rewriting. 

•  These  technologies  can  apply  to  PCA 

-  To  “morph”  to  pre-compiled  configurations  over  small  mission  sets. 

-  To  generate  “sockets”  for  morphs  to  runtime  generated  configurations  for 
large  or  open  mission  sets. 
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Reliability  of  Dynamic  Compiiation 


Concerns  about  reliability  are 

justifiable: 

•The  complexity  of 
dynamic  compiler 
transformations  exceed 
the  complexity  of  modern 
dynamic  instruction  issue 
hardware. 

•In  practice,  Reservoir 
has  found  that  other 
leading  commercial 
release  JVMs  fail  to  pass 
our  random  Java  test 
- suite  (R-JVV - 


Reservoir’s  Approach  to  Dynamic 

Compilation  Testing 

•Intense  application  of  randomly  generated 
tests. 

•IR  maintenance  of  extra  information  to 
detect  optimization  failures,  with  graceful 
recovery. 

•Coverage  analysis  increases  proportion  of 
compiler  that  is  exercised. 

•Limit  deployment  to  a  single  application. 

^  Reservoir’s  mainframe  system  R-DYN™ 
deployment  with  only  one  bug  found  in  4 
years  (Bank  2001) 


Still,  much  R&D  required  specifically  in  compiler  validation: 
e.g..  Model-driven  testing,  compiler  specification  checking,  proof¬ 
carrying  code. 
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Summary 


•  New  programmable  DSP  architectures  can  achieve  performance  near 
fixed  function  hardware  (e.g.  Imagine,  20  GOPS). 

•  ...  but  introduce  major  compiier  chaiienges! 

•  New  streaming  ianguages  can  heip. 

•  Automated  mapping  for  muitipie  cores/distributed  memories  is  criticai 
research  area. 

•  Advancing  automatic  mapping  technoiogy  for  dynamic  compiiation 
has  the  potentiai  to  soive  other  important  probiems. 
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